Skip to main content

Nothing corpus construction

User requirements.

I would like to build a corpus of a few hundred verbal expressions about "nothing". The only requirement is that the word "nothing" appears in the conversation.

Solution.

  1. Set the data source. The best way to find enough dialogue material is to crawl through a novel. Download some random material from the Internet. 在这里插入图片描述

  2. Preliminary cleaning of data. Crawl through the Python search and find that there are about 4646 lines of valid data, if you follow the needs of a few hundred articles, it should be completely enough! 在这里插入图片描述3. After viewing the data examples, users found that they were unable to analyse the specific meaning of "nothing" in relation to the context of the online text, changing the requirement to: the need for a method to retrieve the context. 在这里插入图片描述

  3. To address the above search requirements, an attempt was eventually made to combine all TXT novel files into one large text. In this way, after finding the appropriate "nothing" dialogue, the user can search for the corresponding context in this large text and determine the context. 在这里插入图片描述5. new SOS from the user. requirement description: "Can I extract only the "nothing" in inverted commas, so that I can make sure it is in the conversation, or at least verbal, because nowadays many of them are in descriptions, such as a sees that b is fine and leaves, or something like that, which does not match "verbal"."

  4. The previously exported "nothing" corpus was further cleaned: the rows containing both "nothing" and "inverted commas" were matched and exported as a separate file "Nothing corpus v1.1".

Finally, because the user is overseas, using WeChat to transfer large files, too big to down, even if it is to change the Baidu network disk transfer, the download speed is only 100KB/S, and finally I chose to transfer through Google drive, the last stubbornness of technical people.在这里插入图片描述

The source code and dataset for this project is referenced at 点击此处 备用链接 Extracting passwords:AllenMa